Bioinformatics - Active regulatory regions prediction using deep neural networks

Paolo Cortis

Application of deep supervised learning methods to identify the position of active/inactive cis-regulatory regions.

Note: The following code takes inspiration from the lab notebooks but was very much deconstructed, revised and customized for a better personal understanding, at times, even at the cost of some compactness and encapsulation present in the original.

Libraries

Data retrival: Epigenomic data

Load and visualize epigenomic data (features+labels) for both promoters and enhancers for cell line HepG2.

Binarize labels

It is necessary to convert the real valued labels into discrete binary values to treat the problem as a binary classification task. Hence a threshold is selected to label regions as active or inactive.
Since this binarization is still an open problem and the literature doesn't provide a clear method to perform it, nor a specific cut-off value, the choice of the threshold was guided by the necessity to reach a reasonable balance between the classes and the will to not discard any piece of data.

Promoters features

Promoters labels

Enhancers features

Enhancers labels

Data elaboration and visualization: Epigenomic data

Samples-features ratio

If the number of features is higher than the number of samples the model will behave erratically and overfit.

NaN values

NaN values have a detrimental impact on the performance of the model.

NaN values imputation

The amount of NaN values is small enough to allow imputation.

Data Visualization

Visualizing the data helps to evaluate the difficulty of the task and to design models with the appropiate complexity. In this case data decomposition techniques (PCA and t-SNE) are used to reduce the high dimensional dataset into fewer dimensions (two), while preserving spatial characteristics for visualization purposes.

In this portion of the code epigenomic data is being pre-processed, however, for this visualization part, sequence data is also considered.

PCA

Technique to reduce the dimensionality of a dataset by linearly transforming the data into a new coordinate system where the variation in the data can be described with fewer dimensions than the initial data while preserving most of the original data variation.

t-SNE

Non linear dimensionality reduction technique, based on Stochastic Neighbour Embedding, well-suited for embedding high-dimensional data in a low-dimensional space of two dimensions for visualization.

The following loop plots PCA and t-SNE decomposition for epigenomic data and sequence data considering both promoters and enhancers.

Note: since the labels are binarized the max/mean metric differentiation is useless and produces the same result twice. Due to the computational time required to run t-SNE however, this cell has not been touched in that regard.

Normalization via robust scaler

Data normalization is proven to improve the performance of models.
The robust scaler operates normalization subtracting the median and dividing by the standard deviation between the interquantile range. This is operates a z-scoring while being robust to outliers.

Feature selection

Feature selection implies the choice of a subset of the original pool of features.
This is a useful step as irrelevant or partially relevant features can negatively impact the model performance, hence feature selection reduces noise, improves model accuracy and reduces training time by cutting the total number of features (currently pretty high).

Correlation with the output tests

Features not correlated with the output do not carry relevan information for the model to exploit for the task at hand. These can be removed.

In the following section functions are defined and ran for each of these tests over the entire dataset. This is done here just for explanatory purposes, in fact the resulting correlated features are not dropped from the dataset as this would introduce bias due to data leakage. Instead these tests are saved for the training "main loop" and ran, for each holdout, over the training portion of the data only, maintaining test data unseen and reserving it for actual testing.

Pearson correlation test

Spearman correlation test

Maximal information coefficient

Drop feature uncorrelated with the output

Correlation with the features tests

Features correlated with each other carry the same kind of information. These can be removed.

Pearson correlation test

Identifies linear correlation between features.

Boruta feature selection

The algorithm gives a numerical estimate of the feature importance and categorizes features in those to keep, those to discard and tentative, meaning that there is uncertainty regarding their actual correlation with the output. Both uncorrelated and tentative features are discarded. Experiments were performed when discarding only strictly uncorrelated features, the results, however, were found to be similar.

Learning

The technique used to generate the holdouts for the evaluation of the various architectures is the stratified Monte-Carlo method, producing each time a different arrangement of the training, validation and testing set, while keeping roughly the same class balance.

10 holdouts are considered to train and evaluate each model, 20% of the whole dataset was used for testing, 80% for training where 20% of it is reserved for the validation of the meta-model.

To avoid increasing excessively the computational complexity of the main training loop, the hyperparameter tuning step is performed, for all holdouts, considering a single split of the training set in a training and validation portion.

Evaluate a model

Next are a set of functions to evaluate and visualize the performances of models.

To have a statistically sound estimate of an architecture performance, multiple models are built and trained, each with the same architecture, over different portions of the data (holdouts) and, the average performance of those, is considered as an estimate of the overall performance of the architecture.

Hyperparameter tuning

Hyperparameter tuning is performed training a meta model over training and validation data to obtain the best set of hyperparameters via Bayesian optimization.

Here is the function to train the hypermodel.

MLP

A multi-layer perceptron is a feed forward neural network consiting of one input and output layer with hidden dense layers in-between. To avoid overfitting of the model dropout layers were added.

Here are the functions to build the hypermodel for hyperparameter tuning and to build the model with the best set of hyperparameters found.

Main training loop - MLP

Promoters

Wilcoxon test

The Wilcoxon signed-rank test is a non-parametric statistical hypothesis test which checks the null hypothesis that two related paired samples come from the same distribution. Here, this test is used to determine whether there is any statistically significant difference in the results produced by two different model architectures.

Wilcoxon test [MLP_Promoters_Boruta VS MLP_Promoters_No_Boruta]

Enhancers

Wilcoxon test [MLP_Enhancers_Boruta VS MLP_Enhancers_No_Boruta]

Data retrival: Genomic data

Load and visualize genomic sequence data for both promoters and enhancers for cell line HepG2.

NaN values

Check for NaN values, n/N represent a missing value. In this case NaN values are ignored as there are very few.

CNN data sequence generation

The following step constructs the Mixed Sequence to train CNNs with DNA sequence data.

CNN

A convolutional neural network consists a set of convolutional layers (convolution+maxPooling) and a set of dense layer. To avoid overfitting of the model dropout layers were added.

To have a statistically sound estimate of an architecture performance, multiple models are built and trained, each with the same architecture, over different portions of the data (holdouts) and, the average performance of those, is considered as an estimate of the overall performance of the architecture.

Hyperparameter tuning is performed training a meta model over training and validation data to obtain the best set of hyperparameters via Bayesian optimization.

Here are the functions to evaluate the model performances, build and train the hypermodel for hyperparameter tuning, and to build and train the model with the best set of hyperparameters found.

Main training loop - CNN

Promoters

Wilcoxon test [MLP_Promoters VS CNN_Promoters]

Enhancers

Wilcoxon test [MLP_Enhancers VS CNN_Enhancers]

Multi modal neural network

A multi-modal neural network is a model capable of jointly represent and exploit the information of both data modalities (epigenomic data and genomic sequence data). To shape this network the layers of a FFNN and a CNN are concatenated, each receiving their specific input, the outputs are then combined into a final dense layer and output layer.

The MMNN is here composed in two different ways to inspect the possible impact on the performances:

To have a statistically sound estimate of an architecture performance, multiple models are built and trained, each with the same architecture, over different portions of the data (holdouts) and, the average performance of those, is considered as an estimate of the overall performance of the architecture.

Here are the functions to build the optimized models, build the fixed models, and to build and train the MMNN.

CNN data sequence generation¶

The following step constructs the Mixed Sequence to train MMNNs with epigenomic data and DNA sequence data.

Epigenomic data pre-processing

NaN imputation and robust scaling of epigenomic data.

MMNN data sequence generation

Wrapping together epigenomic and sequence data for the training and evaluation of the MMNN.

Main training loop - MMNN

Promoters

Wilcoxon test [MMNN_Promoters_Bayesian_Tuning VS MMNN_Promoters_No_Bayesian_Tuning]

Enhancers

Wilcoxon test [MMNN_Enhancers_Bayesian_Tuning VS MMNN_Enhancers_No_Bayesian_Tuning]

Complete results visualization - Barplots

Hereby all the results from all the different architectures designed are grouped and visualized via barplots.

Note: as stated at the start of this notebook the code was very much revised and customized, hence results were stored in a different format than that necessary for the barplot library used here. Therefore, the following cells are required to format the results for the barplot visualization.

Testing

Unit tests for the major functions.